Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zeno v2 #166

Draft
wants to merge 303 commits into
base: main
Choose a base branch
from
Draft

Zeno v2 #166

wants to merge 303 commits into from

Conversation

equals215
Copy link
Member

No description provided.

CorentinB and others added 2 commits February 7, 2025 15:25
…iting-at-finish

Log remaining WARC writing while finishing
* add: streaming postprocessing for JSON

* add: streaming postprocessing for S3

* add: TestIsS3

* add: TestS3 + S3 extraction refactoring

* add: streaming postprocessing for XML

* fix: avoid 2-layer assets extraction when HTML is wrongly discovered as asset

* add: use Zeno's User-Agent and custom HTTP client when requesting exclusion file

* add: error handling when doing SetReadDeadline

* fix: extraction from <script> content

* fix: outlinks extraction

---------

Co-authored-by: Corentin Barreau <[email protected]>
@CorentinB CorentinB added the v2 Label to use when describing an issue for Zeno v2 label Feb 7, 2025
vbanos and others added 27 commits February 7, 2025 20:34
The current code looks for `base` tag but doesn't stop if it finds one.
It will still search until the end of the doc.

The suggested improvement uses `doc.Find("base").First()` to get just
the first element as `<base>` is used just once on the HTML doc header.

We also add a unit test.
Simplify hopsToPath and pathToHops
Optimise extractBaseTag and add unit test
Group video[src] and audio[src] selections in the same `goquery.Find`
query because their handling is identical.

Group scanning all doc element for attributes `[data-item], [style], [data-preview]`
in the same `goquery.Find` query.
The previous query `goquery.Find('*')` was returning all elements and
then we checked for specific attributes.
The new query returns only the elements which have one of the specified
attributes, so it should be much faster.

Add unit tests to validate the suggested improvements.
Refactor HTMLAssets and add unit tests
Add missing status 303 "See Other"
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/303

Refactor to make it more readable.
Improve isStatusCodeRedirect
* controler: make makeStageChannel() capable of creating buffered and unbuffered channels

* Rework preprocessor concurrency (#211)

* preprocessor: using fan-in-fan-out pattern instead of dynamic workers pattern ; controler: make the reactor output channel buffered of size WorkersCount

* preprocessor: log wording consistency

* Rework archiver concurrency (#212)

* archiver: using fan-in-fan-out pattern instead of dynamic workers pattern

* cmd,config,archiver: rename MaxConcurrentAssets to MaxConcurrentAssetsPerWorker to make it more explicit that this limit is (to be) enforced PER worker

* Revert "cmd,config,archiver: rename MaxConcurrentAssets to MaxConcurrentAssetsPerWorker to make it more explicit that this limit is (to be) enforced PER worker"

This reverts commit 175af1e.

* preprocessor: use struct pointer for worker() method instead of global variable

* preprocessor: replace preprocessor.run by preprocessor.worker in the fieldedLogger

* Rework postprocessor concurrency (#214)

* postprocessor: using fan-in-fan-out pattern instead of dynamic workers pattern

* controler: make archiver and preprocessor channel buffered by size of WorkersCount

* archiver: check if context is done before passing seeds to the next stage

* Rework finisher concurrency (#219)

* stats: add counters for Finisher routines

* controler: make postprocessor, finisherFinish and finisherProduce chans buffered by size WorkersCount ; consume and discard finisherFinish and finisherProduce when HQ is not used

* finisher: make the finisher concurrent using fan-in-fan-out pattern
* Drop item.seed attribute

Drop `seed` from the `NewItem` constructor.

Replace `item.seed` with `item.IsSeed`

Drop some checks involving `seed` and `parent` in `CheckConsistency`.

* Drop the seed param from all calls to NewItem

* Drop the seed param from all unit tests

Also, remove 3 unit tests which became irrelevant due to the drop of the
seed attribute.
…the logic of routing the logs, rotatedFile implements io.Writer interface
Rework log to use `samber/slog-multi` as a `log/slog` routing abstraction
* init commit to start the PR

* models.url: moved tests from utils package to models package and added a concurrency test for upcoming changes

* models.url: implemented @yzqzss idea to cache result of URLToString to reduce number of calls
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v2 Label to use when describing an issue for Zeno v2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants